Topic Models

Workshop at the BIGSSS Computational Social Science Summer School on Democratic Debate

Christian S. Czymara & Maximilian Weber

7/7/23

Preamble

  • Open this Colab and run the first chunk (“First: Install all packages”) to save time

Agenda

  • What are Topic Models and what do you use them for?
  • Pre-processing
  • How do Topic Models work?
  • Advantages and limitations of Topic Models
  • Exercise: What topics do the main characters in South Park talk about?

Basics

Logic of topic models

  • Topic Modeling is an algorithm for finding the most important themes (topics) in large text collections (corpus).
  • Little prior knowledge about the content needed
  • What (i.e. which topics) is written about?
  • Which topics are addressed particularly frequently?
  • Do the texts differ systematically in their content (between different persons, newspapers, over time, …)?

But first: Data preparation (pre-processing)

Town Musicians of Bremen

  • First of all, the texts have to be broken down into their components (words, punctuation marks, etc.).
  • Example: The Bremen Town Musicians by the Brothers Grimm
There was a man who had a donkey,

which had served him faithfully for many years,

but now his strength was running out,

so that he became more and more unfit for work.
  • Let’s assume that each line is a single document

  • Corpus (text collection) with four documents

Vocabulary

  • The texts are divided into their components
dok1 <- "There was a man who had a donkey,"

dok2 <- "which had served him faithfully for many years,"

dok3 <- "but now his strength was running out,"

dok4 <- "so that he became more and more unfit for work."

bremstmus <- cbind(dok1, dok2, dok3, dok4)

corp_bremstmus <- corpus(bremstmus)
toks_bremstmus <- tokens(corp_bremstmus)

Vocabulary

toks_bremstmus
Tokens consisting of 4 documents.
text1 :
[1] "There"  "was"    "a"      "man"    "who"    "had"    "a"      "donkey"
[9] ","     

text2 :
[1] "which"      "had"        "served"     "him"        "faithfully"
[6] "for"        "many"       "years"      ","         

text3 :
[1] "but"      "now"      "his"      "strength" "was"      "running"  "out"     
[8] ","       

text4 :
 [1] "so"     "that"   "he"     "became" "more"   "and"    "more"   "unfit" 
 [9] "for"    "work"   "."     

Document-Feature-Matrix

  • What we need is a table (matrix)
  • … in which the texts (documents) are in the rows
  • … the words, characters, etc. (features) are in the columns
  • … and in the cells counts how often a feature occurs in the respective text
dfm_bremstmus <- dfm(toks_bremstmus)
dfm_bremstmus
Document-feature matrix of: 4 documents, 30 features (70.83% sparse) and 0 docvars.
       features
docs    there was a man who had donkey , which served
  text1     1   1 2   1   1   1      1 1     0      0
  text2     0   0 0   0   0   1      0 1     1      1
  text3     0   1 0   0   0   0      0 1     0      0
  text4     0   0 0   0   0   0      0 0     0      0
[ reached max_nfeat ... 20 more features ]

Almost done

  • Better: Keep only actual words as features (no numbers, punctuation, etc.)
toks_bremstmus_2 <- tokens(corp_bremstmus,
                           remove_punct = T,
                           remove_numbers = T,
                           remove_symbols = T,
                           remove_separators = T
                           )
  • Remove stop words (common words without real meaning)
toks_bremstmus_2 <-  tokens_remove(toks_bremstmus_2,
                                   stopwords("german"),
                                   case_insensitive = TRUE
                                   )

Almost done

  • Reduce words to their stem (stemming), e.g. “programming”, “programs, and”programmed” all become “program”
toks_bremstmus_2 <- tokens_wordstem(toks_bremstmus_2, language = "german")
  • Create DFM
dfm_bremstmus_2 <- dfm(toks_bremstmus_2)

dfm_bremstmus_2
Document-feature matrix of: 4 documents, 25 features (73.00% sparse) and 0 docvars.
       features
docs    ther a who had donkey which served him faithfully for
  text1    1 2   1   1      1     0      0   0          0   0
  text2    0 0   0   1      0     1      1   1          1   1
  text3    0 0   0   0      0     0      0   0          0   0
  text4    0 0   0   0      0     0      0   0          0   1
[ reached max_nfeat ... 15 more features ]

Topic Models

Assumptions

  • Each word contributes equally to the text (bag-of-words assumption)
  • Each text consists of a mixture of different topics (with different proportions)
  • Texts that discuss similar topics use similar words
  • In other words, certain topics contain some words more than others

The algorithm

  • Two steps:
  • Finding out which words occur together
  • Checking how these words are distributed among the texts
  • Unsupervised machine learning, since no other info needs to be present

The algorithm

  • Default: How many topics should be found?
  • Iterative process designed to maximize two goals simultaneously:
  • Words that occur together frequently are more likely to belong to the same topic
  • Words in the same document are more likely to belong to the same topic

Topic Model in action

Kling (2016): Topic Modelling Portal

Implementation

  • Arguably most convenient: stm package for R by Roberts, Stewart, and Tingley (2019).
  • Big advantage of stm: adding document-level properties (e.g., newspaper, time of tweet, gender of respondent)
  • Second step (after topic model):
  • Regression with documents as units of analysis
  • Topic frequencies as dependent variables
  • Text properties as predictors (see ?estimateEffect)

Advantages of Topic Models

  • The entire corpus is decomposed into its thematic components
  • Little prior knowledge needed, models themselves need few decisions (only number of topics)
  • Often intuitive results
  • Unsupervised: No training data needed, no hand coding
  • Exploration and description of large text data collections

Limitations of Topic Models

  • Results depend on pre-processing decisions
  • Naming topics is subjective
  • What to do with meaningless topics?
  • Analysis of all content can be overwhelming
  • Content of topics can be influenced by the number of topics, especially if the number of topics is small
  • If categories are known before, it is better to classify them directly (supervised ML)

Summary

  • Topic Models find out which words occur together
  • … and thus which topics are discussed in which texts
  • Very helpful for general overviews, no prior theoretical knowledge necessary
  • For more specific research questions, supervised classification often better

Exercise

Exercise

South Park

  • What topics do Cartman, Stan, Kyle and Kenny talk about in South Park? Who is swearing the most?
  • We will work with Google Colab for this exercise
  • stm and other R packages are in this Colab
  • BERTopic and other approaches in Python are in this Colab